[Hybrid Allocator] Support KV cache groups with different block_size #24949

heheda12345 · 2025-09-16T07:34:15Z

Purpose

Hybrid allocator requires all layer has the same physical memory per block now. But models like #24916 (bf16 for sliding window attention and fp8 for full attention) will have different memory per layer.

This pr supports these cases by giving different layers different block_sizes to make the physical memory per block the same. Require one layer's memory per block is a multiple of the other now.

To support prefix caching, we need:

make sure the prefix caching hit length is a multiple of all block_sizes. This is achieved by introducing an alignment factor to get_longest_cache_hit and set the alignment requirement to the LCM of all block_sizes.
generate the BlockHash of each block size from the BlockHash with block_size=cache_config.block_size.

For 2, we can generate the block hash with a larger block_size from that with a smaller block size. For example, with block hash of block_size 16, we can get the block hash with block_hash 32 by concatenating two hash value with block_size 16 to one hash value with block_size 32:

block_hash with block_size 16:

token	0-15	16-31	32-47	48-63
Hash	A	B	C	D

block_hash with block_size 32:

token	0-31	32-63
Hash	AB	CD

Note: for non-hybrid model with different hidden size per layer like #22432 , we may still keep the block size the same for all layers. I plan to do it in a future PR.

Test Plan

Set the kv_dtype of either sliding window attention or full attention to fp8 and run

test_correctness_sliding_window.py::test_sliding_window_retrieval[False-1-5-google/gemma-3-1b-it]

And also necessary unit tests.

Test Result

Success

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Chen Zhang <[email protected]>

heheda12345 · 2025-09-16T19:51:45Z

vllm/attention/layer.py

            f"num_heads ({num_heads}) is not " \
            f"divisible by num_kv_heads ({num_kv_heads})"

+        # TODO in this PR: only for testing now. remove this hardcode later


self reminder: remove this

heheda12345 · 2025-09-16T19:51:55Z

vllm/v1/core/kv_cache_utils.py

    for kv_cache_config in kv_cache_configs:
        kv_cache_config.num_blocks = min_num_blocks
+    # TODO: remove this print
+    print("kv_cache_configs", kv_cache_configs[0])


self reminder: remove this

heheda12345 · 2025-09-16T19:52:14Z

vllm/v1/worker/gpu_model_runner.py

        attn_layers = get_layers_from_vllm_config(self.vllm_config, Attention)
+
+        # TODO in this PR: revert this
+        def get_torch_dtype(kv_cache_dtype: str) -> torch.dtype:


self reminder: remove this and do it in a future pr

heheda12345 · 2025-09-16T22:28:53Z

Wait this PR only supports bf16 for full attention and fp8 for sliding window. Trying to fix fp8 for full attention and bf16 for sliding window.

Signed-off-by: Chen Zhang <[email protected]>

heheda12345 · 2025-09-16T23:23:57Z

Wait this PR only supports bf16 for full attention and fp8 for sliding window. Trying to fix fp8 for full attention and bf16 for sliding window.
Fixed.

Signed-off-by: Chen Zhang <[email protected]>

jmkuebler · 2025-09-17T08:51:30Z

tests/v1/core/test_kv_cache_utils.py

+                                                     block_size=32)),
+        ],
+    )
+


Would you mind adding a test for the mixed dtype case?

# Different dtype, align by using different block size kv_cache_specs_hybrid = { 'layer_1': new_kv_cache_spec(dtype=torch.float8_e4m3fn), 'layer_2': new_sliding_window_spec(dtype=torch.bfloat16), } kv_cache_config_hybrid = get_kv_cache_configs( vllm_config, [kv_cache_specs_hybrid], [mem_per_block_per_layer * 32])[0] assert kv_cache_config_hybrid == KVCacheConfig( num_blocks=32 * 2, # 2x blocks because baseline is BF16 (not FP32) kv_cache_tensors=[ KVCacheTensor(size=mem_per_block_per_layer * 32, shared_by=["layer_1", "layer_2"]), ], kv_cache_groups=[ KVCacheGroupSpec(["layer_1"], new_kv_cache_spec(dtype=torch.float8_e4m3fn, block_size=32)), KVCacheGroupSpec(["layer_2"], new_sliding_window_spec(dtype=torch.bfloat16, block_size=16)), ], )

similarly could use new_kv_cache_spec, as it is nothing speficif to new_sliding_window_spec I'd say.

Would you mind adding a test for the mixed dtype case?

I think there is no difference on mixed dtype & mixed head size from the view of this PR. Feel free to add tests when you are working on mixed dtype support.

similarly could use new_kv_cache_spec, as it is nothing speficif to new_sliding_window_spec I'd say.

For models only with full attention, we can have a much simpler path because we don't need to ensure all layers have the same page_size_bytes. I'm working on it in another PR.

mergify · 2025-09-21T01:07:31Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @heheda12345.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

josephrocca · 2025-10-02T07:09:57Z

Edit: I just remembered that DCP doesn't currently support FP8 KV cache, so it seems likely that that's the issue here?

Unsure if this is expected, since IIUC this PR is not yet finished, but I get a crash during startup at assert hash_block_size == self.kv_cache_spec.block_size when testing this PR with -dcp 4 like so:

git clone --branch two_dtype_kv_cache https://github.com/heheda12345/vllm && cd vllm && git reset --hard aaf8bc9366fa270dc0b5eea81dec3a01206bd6ef
VLLM_USE_PRECOMPILED=1 uv pip install --editable .[flashinfer]
vllm serve RedHatAI/DeepSeek-R1-0528-quantized.w4a16 --tensor-parallel-size 4 -dcp 4 --served-model-name default --max-model-len 9216 --kv-cache-dtype fp8_e4m3

It works fine without -dcp 4. It might be a bug with -dcp rather than this PR.

I've only tested on a 4xH200 machine.

Click here for crash logs

/root/venv/lib/python3.12/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
INFO 10-02 06:59:56 [__init__.py:216] Automatically detected platform cuda.
(APIServer pid=55382) INFO 10-02 06:59:58 [api_server.py:1911] vLLM API server version 0.1.dev9506+gaaf8bc936
(APIServer pid=55382) INFO 10-02 06:59:58 [utils.py:328] non-default args: {'model_tag': 'RedHatAI/DeepSeek-R1-0528-quantized.w4a16', 'host': '0.0.0.0', 'port': 4000, 'model': 'RedHatAI/DeepSeek-R1-0528-quantized.w4a16', 'max_model_len': 9216, 'served_model_name': ['default'], 'tensor_parallel_size': 4, 'decode_context_parallel_size': 4, 'kv_cache_dtype': 'fp8_e4m3'}
(APIServer pid=55382) INFO 10-02 07:00:05 [__init__.py:706] Resolved architecture: DeepseekV3ForCausalLM
(APIServer pid=55382) `torch_dtype` is deprecated! Use `dtype` instead!
(APIServer pid=55382) INFO 10-02 07:00:05 [__init__.py:1782] Using max model len 9216
(APIServer pid=55382) INFO 10-02 07:00:05 [cache.py:174] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor.
(APIServer pid=55382) INFO 10-02 07:00:05 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=55382) INFO 10-02 07:00:06 [cuda.py:174] Forcing kv cache block size to 64 for FlashMLA backend.
/root/venv/lib/python3.12/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
INFO 10-02 07:00:09 [__init__.py:216] Automatically detected platform cuda.
(EngineCore_DP0 pid=55692) INFO 10-02 07:00:12 [core.py:648] Waiting for init message from front-end.
(EngineCore_DP0 pid=55692) INFO 10-02 07:00:12 [core.py:75] Initializing a V1 LLM engine (v0.1.dev9506+gaaf8bc936) with config: model='RedHatAI/DeepSeek-R1-0528-quantized.w4a16', speculative_config=None, tokenizer='RedHatAI/DeepSeek-R1-0528-quantized.w4a16', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=9216, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=fp8_e4m3, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=default, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":1,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null}
(EngineCore_DP0 pid=55692) WARNING 10-02 07:00:12 [multiproc_worker_utils.py:273] Reducing Torch parallelism from 176 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_DP0 pid=55692) INFO 10-02 07:00:12 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3], buffer_handle=(4, 16777216, 10, 'psm_50b22551'), local_subscribe_addr='ipc:///tmp/23eee754-53b0-46a3-9d64-b18be4d99e16', remote_subscribe_addr=None, remote_addr_ipv6=False)
/root/venv/lib/python3.12/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
/root/venv/lib/python3.12/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
/root/venv/lib/python3.12/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
/root/venv/lib/python3.12/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
INFO 10-02 07:00:15 [__init__.py:216] Automatically detected platform cuda.
INFO 10-02 07:00:15 [__init__.py:216] Automatically detected platform cuda.
INFO 10-02 07:00:15 [__init__.py:216] Automatically detected platform cuda.
INFO 10-02 07:00:15 [__init__.py:216] Automatically detected platform cuda.
W1002 07:00:17.496000 55841 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W1002 07:00:17.496000 55841 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W1002 07:00:17.496000 55842 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W1002 07:00:17.496000 55842 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W1002 07:00:17.501000 55843 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W1002 07:00:17.501000 55843 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W1002 07:00:17.504000 55844 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W1002 07:00:17.504000 55844 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
INFO 10-02 07:00:19 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_0911e17a'), local_subscribe_addr='ipc:///tmp/26e19166-8cb8-4612-9642-7eeaf8adbad2', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 10-02 07:00:19 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_4394977f'), local_subscribe_addr='ipc:///tmp/9aaf23ba-3d42-4e97-a821-5199effb2be4', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 10-02 07:00:19 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_8ffeaa41'), local_subscribe_addr='ipc:///tmp/ee4394b3-b8cf-4ede-9ceb-8aa5e152341d', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 10-02 07:00:19 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_1325f30e'), local_subscribe_addr='ipc:///tmp/4043c4b9-7faf-4c4c-bd7a-cebf9ed78bc8', remote_subscribe_addr=None, remote_addr_ipv6=False)
[W1002 07:00:19.317282572 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W1002 07:00:20.559870191 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W1002 07:00:20.818073215 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W1002 07:00:20.824695277 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
INFO 10-02 07:00:20 [__init__.py:1439] Found nccl from library libnccl.so.2
INFO 10-02 07:00:20 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 10-02 07:00:20 [__init__.py:1439] Found nccl from library libnccl.so.2
INFO 10-02 07:00:20 [__init__.py:1439] Found nccl from library libnccl.so.2
INFO 10-02 07:00:20 [__init__.py:1439] Found nccl from library libnccl.so.2
INFO 10-02 07:00:20 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 10-02 07:00:20 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 10-02 07:00:20 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 10-02 07:00:21 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver's P2P report.
INFO 10-02 07:00:21 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver's P2P report.
INFO 10-02 07:00:21 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver's P2P report.
INFO 10-02 07:00:21 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver's P2P report.
INFO 10-02 07:00:21 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, 'psm_b4e5bab4'), local_subscribe_addr='ipc:///tmp/da574df2-ff6a-4888-8790-a7924d439f03', remote_subscribe_addr=None, remote_addr_ipv6=False)
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
INFO 10-02 07:00:21 [__init__.py:1439] Found nccl from library libnccl.so.2
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
INFO 10-02 07:00:21 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 10-02 07:00:21 [__init__.py:1439] Found nccl from library libnccl.so.2
INFO 10-02 07:00:21 [__init__.py:1439] Found nccl from library libnccl.so.2
INFO 10-02 07:00:21 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 10-02 07:00:21 [__init__.py:1439] Found nccl from library libnccl.so.2
INFO 10-02 07:00:21 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 10-02 07:00:21 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 10-02 07:00:21 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, 'psm_4082f51b'), local_subscribe_addr='ipc:///tmp/b9fbcccf-f2af-4268-a648-3e198bafaac5', remote_subscribe_addr=None, remote_addr_ipv6=False)
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
INFO 10-02 07:00:21 [parallel_state.py:1206] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 10-02 07:00:21 [parallel_state.py:1206] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
INFO 10-02 07:00:21 [parallel_state.py:1206] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 3, EP rank 3
INFO 10-02 07:00:21 [parallel_state.py:1206] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 2, EP rank 2
INFO 10-02 07:00:21 [topk_topp_sampler.py:58] Using FlashInfer for top-p & top-k sampling.
INFO 10-02 07:00:21 [topk_topp_sampler.py:58] Using FlashInfer for top-p & top-k sampling.
INFO 10-02 07:00:21 [topk_topp_sampler.py:58] Using FlashInfer for top-p & top-k sampling.
INFO 10-02 07:00:21 [topk_topp_sampler.py:58] Using FlashInfer for top-p & top-k sampling.
(Worker_TP3 pid=55844) INFO 10-02 07:00:21 [gpu_model_runner.py:2357] Starting to load model RedHatAI/DeepSeek-R1-0528-quantized.w4a16...
(Worker_TP2 pid=55843) INFO 10-02 07:00:21 [gpu_model_runner.py:2357] Starting to load model RedHatAI/DeepSeek-R1-0528-quantized.w4a16...
(Worker_TP1 pid=55842) INFO 10-02 07:00:21 [gpu_model_runner.py:2357] Starting to load model RedHatAI/DeepSeek-R1-0528-quantized.w4a16...
(Worker_TP0 pid=55841) INFO 10-02 07:00:21 [gpu_model_runner.py:2357] Starting to load model RedHatAI/DeepSeek-R1-0528-quantized.w4a16...
(Worker_TP3 pid=55844) INFO 10-02 07:00:22 [gpu_model_runner.py:2389] Loading model from scratch...
(Worker_TP2 pid=55843) INFO 10-02 07:00:22 [gpu_model_runner.py:2389] Loading model from scratch...
(Worker_TP1 pid=55842) INFO 10-02 07:00:22 [gpu_model_runner.py:2389] Loading model from scratch...
(Worker_TP0 pid=55841) INFO 10-02 07:00:22 [gpu_model_runner.py:2389] Loading model from scratch...
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.0.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.0.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.0.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.0.self_attn.attn
(Worker_TP2 pid=55843) INFO 10-02 07:00:22 [cuda.py:258] Using FlashMLA backend on V1 engine.
(Worker_TP3 pid=55844) INFO 10-02 07:00:22 [cuda.py:258] Using FlashMLA backend on V1 engine.
(Worker_TP0 pid=55841) INFO 10-02 07:00:22 [cuda.py:258] Using FlashMLA backend on V1 engine.
(Worker_TP1 pid=55842) INFO 10-02 07:00:22 [cuda.py:258] Using FlashMLA backend on V1 engine.
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.1.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.1.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.1.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.1.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.2.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.2.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.2.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.2.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.3.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.3.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.3.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.3.self_attn.attn
(Worker_TP0 pid=55841) INFO 10-02 07:00:22 [compressed_tensors_moe.py:121] Using CompressedTensorsWNA16MarlinMoEMethod
(Worker_TP1 pid=55842) INFO 10-02 07:00:22 [compressed_tensors_moe.py:121] Using CompressedTensorsWNA16MarlinMoEMethod
(Worker_TP2 pid=55843) INFO 10-02 07:00:22 [compressed_tensors_moe.py:121] Using CompressedTensorsWNA16MarlinMoEMethod
(Worker_TP3 pid=55844) INFO 10-02 07:00:22 [compressed_tensors_moe.py:121] Using CompressedTensorsWNA16MarlinMoEMethod
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.4.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.4.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.4.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.4.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.5.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.5.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.5.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.5.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.6.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.6.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.6.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.6.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.7.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.7.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.7.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.7.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.8.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.8.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.8.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.8.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.9.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.9.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.9.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.9.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.10.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.10.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.10.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.10.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.11.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.11.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.11.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.11.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.12.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.12.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.12.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.12.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.13.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.13.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.13.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.13.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.14.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.14.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.14.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.14.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.15.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.15.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.15.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.15.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.16.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.16.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.16.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.16.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.17.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.17.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.17.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.17.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.18.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.18.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.18.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.18.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.19.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.19.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.19.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.19.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.20.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.20.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.20.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.20.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.21.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.21.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.21.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.21.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.22.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.22.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.22.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.22.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.23.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.23.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.23.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.23.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.24.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.24.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.24.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.24.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.25.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.25.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.25.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.25.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.26.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.26.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.26.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.26.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.27.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.27.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.27.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.27.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.28.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.28.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.29.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.28.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.28.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.29.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.30.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.29.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.29.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.30.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.31.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.30.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.30.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.31.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.32.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.31.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.31.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.32.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.33.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.32.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.32.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.33.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.34.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.33.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.33.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.34.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.35.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.34.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.34.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.35.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.36.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.35.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.35.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.36.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.37.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.36.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.37.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.36.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.38.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.37.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.37.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.38.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.39.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.39.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.38.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.38.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.40.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.39.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.39.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.40.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.41.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.40.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.40.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.41.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.42.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.41.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.41.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.42.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.43.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.42.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.43.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.44.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.42.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.43.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.43.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.44.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.45.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.44.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.44.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.46.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.45.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.45.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.47.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.46.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.45.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.48.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.46.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.46.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.47.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.49.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.48.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.47.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.47.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.50.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.49.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.48.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.48.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.51.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.50.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.49.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.49.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.52.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.51.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.50.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.50.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.53.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.52.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.51.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.51.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.54.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.53.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.52.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.52.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.55.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.54.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.53.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.53.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.56.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.54.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.54.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.55.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.57.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.56.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.55.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.55.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.58.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.56.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.56.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.57.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.59.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.58.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.57.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.57.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.60.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.59.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.58.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.58.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.60.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.59.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.59.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.60.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.60.self_attn.attn
(Worker_TP3 pid=55844) INFO 10-02 07:00:23 [weight_utils.py:348] Using model weights format ['*.safetensors']
(Worker_TP2 pid=55843) INFO 10-02 07:00:23 [weight_utils.py:348] Using model weights format ['*.safetensors']
(Worker_TP0 pid=55841) INFO 10-02 07:00:23 [weight_utils.py:348] Using model weights format ['*.safetensors']
(Worker_TP1 pid=55842) INFO 10-02 07:00:23 [weight_utils.py:348] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/63 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   2% Completed | 1/63 [00:03<03:14,  3.13s/it]
Loading safetensors checkpoint shards:   3% Completed | 2/63 [00:06<03:18,  3.25s/it]
Loading safetensors checkpoint shards:   5% Completed | 3/63 [00:09<03:17,  3.29s/it]
Loading safetensors checkpoint shards:   6% Completed | 4/63 [00:13<03:15,  3.32s/it]
Loading safetensors checkpoint shards:   8% Completed | 5/63 [00:16<03:13,  3.34s/it]
Loading safetensors checkpoint shards:  10% Completed | 6/63 [00:19<03:12,  3.37s/it]
Loading safetensors checkpoint shards:  11% Completed | 7/63 [00:23<03:09,  3.39s/it]
Loading safetensors checkpoint shards:  13% Completed | 8/63 [00:26<03:07,  3.40s/it]
Loading safetensors checkpoint shards:  14% Completed | 9/63 [00:30<03:04,  3.42s/it]
Loading safetensors checkpoint shards:  16% Completed | 10/63 [00:33<03:00,  3.40s/it]
Loading safetensors checkpoint shards:  17% Completed | 11/63 [00:37<02:57,  3.41s/it]
Loading safetensors checkpoint shards:  19% Completed | 12/63 [00:40<02:55,  3.43s/it]
Loading safetensors checkpoint shards:  21% Completed | 13/63 [00:44<02:52,  3.45s/it]
Loading safetensors checkpoint shards:  22% Completed | 14/63 [00:47<02:49,  3.46s/it]
Loading safetensors checkpoint shards:  24% Completed | 15/63 [00:47<02:00,  2.51s/it]
Loading safetensors checkpoint shards:  25% Completed | 16/63 [00:51<02:09,  2.75s/it]
Loading safetensors checkpoint shards:  27% Completed | 17/63 [00:54<02:16,  2.97s/it]
Loading safetensors checkpoint shards:  29% Completed | 18/63 [00:58<02:20,  3.12s/it]
Loading safetensors checkpoint shards:  30% Completed | 19/63 [01:01<02:22,  3.23s/it]
Loading safetensors checkpoint shards:  32% Completed | 20/63 [01:05<02:22,  3.31s/it]
Loading safetensors checkpoint shards:  33% Completed | 21/63 [01:08<02:20,  3.35s/it]
Loading safetensors checkpoint shards:  35% Completed | 22/63 [01:11<02:18,  3.38s/it]
Loading safetensors checkpoint shards:  37% Completed | 23/63 [01:15<02:16,  3.41s/it]
Loading safetensors checkpoint shards:  38% Completed | 24/63 [01:18<02:14,  3.44s/it]
Loading safetensors checkpoint shards:  40% Completed | 25/63 [01:22<02:11,  3.46s/it]
Loading safetensors checkpoint shards:  41% Completed | 26/63 [01:22<01:32,  2.49s/it]
Loading safetensors checkpoint shards:  43% Completed | 27/63 [01:26<01:38,  2.73s/it]
Loading safetensors checkpoint shards:  44% Completed | 28/63 [01:29<01:43,  2.95s/it]
Loading safetensors checkpoint shards:  46% Completed | 29/63 [01:32<01:45,  3.11s/it]
Loading safetensors checkpoint shards:  48% Completed | 30/63 [01:36<01:46,  3.22s/it]
Loading safetensors checkpoint shards:  49% Completed | 31/63 [01:39<01:45,  3.30s/it]
Loading safetensors checkpoint shards:  51% Completed | 32/63 [01:43<01:43,  3.35s/it]
Loading safetensors checkpoint shards:  52% Completed | 33/63 [01:46<01:41,  3.40s/it]
Loading safetensors checkpoint shards:  54% Completed | 34/63 [01:50<01:39,  3.42s/it]
Loading safetensors checkpoint shards:  56% Completed | 35/63 [01:53<01:36,  3.43s/it]
Loading safetensors checkpoint shards:  57% Completed | 36/63 [01:57<01:33,  3.45s/it]
Loading safetensors checkpoint shards:  59% Completed | 37/63 [02:00<01:29,  3.46s/it]
Loading safetensors checkpoint shards:  60% Completed | 38/63 [02:04<01:26,  3.47s/it]
Loading safetensors checkpoint shards:  62% Completed | 39/63 [02:07<01:23,  3.48s/it]
Loading safetensors checkpoint shards:  63% Completed | 40/63 [02:11<01:20,  3.49s/it]
Loading safetensors checkpoint shards:  65% Completed | 41/63 [02:14<01:16,  3.49s/it]
(Worker_TP3 pid=55844) INFO 10-02 07:02:40 [default_loader.py:268] Loading weights took 136.75 seconds
(Worker_TP3 pid=55844) WARNING 10-02 07:02:40 [kv_cache.py:86] Checkpoint does not provide a q scaling factor. Setting it to k_scale. This only matters for the flash-attn backend.
(Worker_TP3 pid=55844) WARNING 10-02 07:02:40 [kv_cache.py:100] Using KV cache scaling factor 1.0 for fp8_e4m3. This may cause accuracy issues. Please make sure k/v_scale scaling factors are available in the fp8 checkpoint.
(Worker_TP2 pid=55843) INFO 10-02 07:02:41 [default_loader.py:268] Loading weights took 137.44 seconds
(Worker_TP2 pid=55843) WARNING 10-02 07:02:41 [kv_cache.py:86] Checkpoint does not provide a q scaling factor. Setting it to k_scale. This only matters for the flash-attn backend.
(Worker_TP2 pid=55843) WARNING 10-02 07:02:41 [kv_cache.py:100] Using KV cache scaling factor 1.0 for fp8_e4m3. This may cause accuracy issues. Please make sure k/v_scale scaling factors are available in the fp8 checkpoint.
Loading safetensors checkpoint shards:  67% Completed | 42/63 [02:18<01:13,  3.49s/it]
(Worker_TP3 pid=55844) INFO 10-02 07:02:42 [gpu_model_runner.py:2411] Model loading took 87.9393 GiB and 140.081488 seconds
(Worker_TP2 pid=55843) INFO 10-02 07:02:43 [gpu_model_runner.py:2411] Model loading took 87.9393 GiB and 141.136247 seconds
Loading safetensors checkpoint shards:  68% Completed | 43/63 [02:21<01:09,  3.49s/it]
Loading safetensors checkpoint shards:  70% Completed | 44/63 [02:25<01:06,  3.48s/it]
Loading safetensors checkpoint shards:  71% Completed | 45/63 [02:28<01:02,  3.47s/it]
Loading safetensors checkpoint shards:  73% Completed | 46/63 [02:32<00:59,  3.48s/it]
Loading safetensors checkpoint shards:  75% Completed | 47/63 [02:35<00:55,  3.49s/it]
Loading safetensors checkpoint shards:  76% Completed | 48/63 [02:35<00:37,  2.52s/it]
Loading safetensors checkpoint shards:  78% Completed | 49/63 [02:39<00:38,  2.75s/it]
Loading safetensors checkpoint shards:  79% Completed | 50/63 [02:42<00:38,  2.98s/it]
Loading safetensors checkpoint shards:  81% Completed | 51/63 [02:46<00:37,  3.14s/it]
Loading safetensors checkpoint shards:  83% Completed | 52/63 [02:49<00:35,  3.24s/it]
Loading safetensors checkpoint shards:  84% Completed | 53/63 [02:53<00:33,  3.31s/it]
Loading safetensors checkpoint shards:  86% Completed | 54/63 [02:56<00:30,  3.36s/it]
Loading safetensors checkpoint shards:  87% Completed | 55/63 [03:00<00:27,  3.40s/it]
Loading safetensors checkpoint shards:  89% Completed | 56/63 [03:00<00:17,  2.48s/it]
Loading safetensors checkpoint shards:  90% Completed | 57/63 [03:03<00:16,  2.72s/it]
Loading safetensors checkpoint shards:  92% Completed | 58/63 [03:07<00:14,  2.95s/it]
Loading safetensors checkpoint shards:  94% Completed | 59/63 [03:07<00:08,  2.17s/it]
Loading safetensors checkpoint shards:  95% Completed | 60/63 [03:10<00:07,  2.50s/it]
Loading safetensors checkpoint shards:  97% Completed | 61/63 [03:14<00:05,  2.79s/it]
Loading safetensors checkpoint shards:  98% Completed | 62/63 [03:17<00:03,  3.01s/it]
Loading safetensors checkpoint shards: 100% Completed | 63/63 [03:21<00:00,  3.16s/it]
Loading safetensors checkpoint shards: 100% Completed | 63/63 [03:21<00:00,  3.20s/it]
(Worker_TP0 pid=55841) 
(Worker_TP0 pid=55841) INFO 10-02 07:03:45 [default_loader.py:268] Loading weights took 201.63 seconds
(Worker_TP0 pid=55841) WARNING 10-02 07:03:45 [kv_cache.py:86] Checkpoint does not provide a q scaling factor. Setting it to k_scale. This only matters for the flash-attn backend.
(Worker_TP0 pid=55841) WARNING 10-02 07:03:45 [kv_cache.py:100] Using KV cache scaling factor 1.0 for fp8_e4m3. This may cause accuracy issues. Please make sure k/v_scale scaling factors are available in the fp8 checkpoint.
(Worker_TP1 pid=55842) INFO 10-02 07:03:48 [default_loader.py:268] Loading weights took 203.87 seconds
(Worker_TP1 pid=55842) WARNING 10-02 07:03:48 [kv_cache.py:86] Checkpoint does not provide a q scaling factor. Setting it to k_scale. This only matters for the flash-attn backend.
(Worker_TP1 pid=55842) WARNING 10-02 07:03:48 [kv_cache.py:100] Using KV cache scaling factor 1.0 for fp8_e4m3. This may cause accuracy issues. Please make sure k/v_scale scaling factors are available in the fp8 checkpoint.
(Worker_TP0 pid=55841) INFO 10-02 07:03:49 [gpu_model_runner.py:2411] Model loading took 87.9393 GiB and 206.858939 seconds
(Worker_TP1 pid=55842) INFO 10-02 07:03:51 [gpu_model_runner.py:2411] Model loading took 87.9393 GiB and 208.650490 seconds
(Worker_TP3 pid=55844) INFO 10-02 07:03:59 [backends.py:539] Using cache directory: /root/.cache/vllm/torch_compile_cache/89087ccd47/rank_3_0/backbone for vLLM's torch.compile
(Worker_TP3 pid=55844) INFO 10-02 07:03:59 [backends.py:550] Dynamo bytecode transform time: 7.79 s
(Worker_TP2 pid=55843) INFO 10-02 07:03:59 [backends.py:539] Using cache directory: /root/.cache/vllm/torch_compile_cache/89087ccd47/rank_2_0/backbone for vLLM's torch.compile
(Worker_TP2 pid=55843) INFO 10-02 07:03:59 [backends.py:550] Dynamo bytecode transform time: 7.91 s
(Worker_TP3 pid=55844) INFO 10-02 07:04:02 [backends.py:161] Directly load the compiled graph(s) for dynamic shape from the cache, took 3.172 s
(Worker_TP2 pid=55843) INFO 10-02 07:04:03 [backends.py:161] Directly load the compiled graph(s) for dynamic shape from the cache, took 3.205 s
(Worker_TP0 pid=55841) INFO 10-02 07:04:03 [backends.py:539] Using cache directory: /root/.cache/vllm/torch_compile_cache/89087ccd47/rank_0_0/backbone for vLLM's torch.compile
(Worker_TP0 pid=55841) INFO 10-02 07:04:03 [backends.py:550] Dynamo bytecode transform time: 11.51 s
(Worker_TP1 pid=55842) INFO 10-02 07:04:03 [backends.py:539] Using cache directory: /root/.cache/vllm/torch_compile_cache/89087ccd47/rank_1_0/backbone for vLLM's torch.compile
(Worker_TP1 pid=55842) INFO 10-02 07:04:03 [backends.py:550] Dynamo bytecode transform time: 11.58 s
(Worker_TP0 pid=55841) INFO 10-02 07:04:08 [backends.py:161] Directly load the compiled graph(s) for dynamic shape from the cache, took 4.928 s
(Worker_TP1 pid=55842) INFO 10-02 07:04:08 [backends.py:161] Directly load the compiled graph(s) for dynamic shape from the cache, took 4.936 s
(Worker_TP2 pid=55843) INFO 10-02 07:04:11 [monitor.py:34] torch.compile takes 7.91 s in total
(Worker_TP3 pid=55844) INFO 10-02 07:04:11 [monitor.py:34] torch.compile takes 7.79 s in total
(Worker_TP1 pid=55842) INFO 10-02 07:04:11 [monitor.py:34] torch.compile takes 11.58 s in total
(Worker_TP0 pid=55841) INFO 10-02 07:04:11 [monitor.py:34] torch.compile takes 11.51 s in total
(Worker_TP3 pid=55844) INFO 10-02 07:04:13 [gpu_worker.py:299] Available KV cache memory: 33.99 GiB
(Worker_TP2 pid=55843) INFO 10-02 07:04:13 [gpu_worker.py:299] Available KV cache memory: 33.99 GiB
(Worker_TP0 pid=55841) INFO 10-02 07:04:13 [gpu_worker.py:299] Available KV cache memory: 33.99 GiB
(Worker_TP1 pid=55842) INFO 10-02 07:04:13 [gpu_worker.py:299] Available KV cache memory: 33.99 GiB
(EngineCore_DP0 pid=55692) INFO 10-02 07:04:14 [kv_cache_utils.py:1048] Multiplying the GPU KV cache size by the dcp_world_size 4.
(EngineCore_DP0 pid=55692) INFO 10-02 07:04:14 [kv_cache_utils.py:1052] GPU KV cache size: 4,154,112 tokens
(EngineCore_DP0 pid=55692) INFO 10-02 07:04:14 [kv_cache_utils.py:1056] Maximum concurrency for 9,216 tokens per request: 450.75x
(EngineCore_DP0 pid=55692) INFO 10-02 07:04:14 [kv_cache_utils.py:1048] Multiplying the GPU KV cache size by the dcp_world_size 4.
(EngineCore_DP0 pid=55692) INFO 10-02 07:04:14 [kv_cache_utils.py:1052] GPU KV cache size: 4,154,112 tokens
(EngineCore_DP0 pid=55692) INFO 10-02 07:04:14 [kv_cache_utils.py:1056] Maximum concurrency for 9,216 tokens per request: 450.75x
(EngineCore_DP0 pid=55692) INFO 10-02 07:04:14 [kv_cache_utils.py:1048] Multiplying the GPU KV cache size by the dcp_world_size 4.
(EngineCore_DP0 pid=55692) INFO 10-02 07:04:14 [kv_cache_utils.py:1052] GPU KV cache size: 4,154,112 tokens
(EngineCore_DP0 pid=55692) INFO 10-02 07:04:14 [kv_cache_utils.py:1056] Maximum concurrency for 9,216 tokens per request: 450.75x
(EngineCore_DP0 pid=55692) INFO 10-02 07:04:14 [kv_cache_utils.py:1048] Multiplying the GPU KV cache size by the dcp_world_size 4.
(EngineCore_DP0 pid=55692) INFO 10-02 07:04:14 [kv_cache_utils.py:1052] GPU KV cache size: 4,154,112 tokens
(EngineCore_DP0 pid=55692) INFO 10-02 07:04:14 [kv_cache_utils.py:1056] Maximum concurrency for 9,216 tokens per request: 450.75x
(EngineCore_DP0 pid=55692) kv_cache_configs KVCacheConfig(num_blocks=16227, kv_cache_tensors=[KVCacheTensor(size=598192128, shared_by=['model.layers.0.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.1.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.2.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.3.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.4.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.5.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.6.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.7.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.8.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.9.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.10.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.11.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.12.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.13.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.14.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.15.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.16.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.17.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.18.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.19.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.20.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.21.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.22.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.23.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.24.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.25.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.26.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.27.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.28.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.29.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.30.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.31.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.32.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.33.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.34.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.35.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.36.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.37.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.38.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.39.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.40.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.41.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.42.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.43.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.44.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.45.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.46.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.47.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.48.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.49.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.50.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.51.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.52.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.53.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.54.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.55.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.56.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.57.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.58.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.59.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.60.self_attn.attn'])], kv_cache_groups=[KVCacheGroupSpec(layer_names=['model.layers.0.self_attn.attn', 'model.layers.1.self_attn.attn', 'model.layers.2.self_attn.attn', 'model.layers.3.self_attn.attn', 'model.layers.4.self_attn.attn', 'model.layers.5.self_attn.attn', 'model.layers.6.self_attn.attn', 'model.layers.7.self_attn.attn', 'model.layers.8.self_attn.attn', 'model.layers.9.self_attn.attn', 'model.layers.10.self_attn.attn', 'model.layers.11.self_attn.attn', 'model.layers.12.self_attn.attn', 'model.layers.13.self_attn.attn', 'model.layers.14.self_attn.attn', 'model.layers.15.self_attn.attn', 'model.layers.16.self_attn.attn', 'model.layers.17.self_attn.attn', 'model.layers.18.self_attn.attn', 'model.layers.19.self_attn.attn', 'model.layers.20.self_attn.attn', 'model.layers.21.self_attn.attn', 'model.layers.22.self_attn.attn', 'model.layers.23.self_attn.attn', 'model.layers.24.self_attn.attn', 'model.layers.25.self_attn.attn', 'model.layers.26.self_attn.attn', 'model.layers.27.self_attn.attn', 'model.layers.28.self_attn.attn', 'model.layers.29.self_attn.attn', 'model.layers.30.self_attn.attn', 'model.layers.31.self_attn.attn', 'model.layers.32.self_attn.attn', 'model.layers.33.self_attn.attn', 'model.layers.34.self_attn.attn', 'model.layers.35.self_attn.attn', 'model.layers.36.self_attn.attn', 'model.layers.37.self_attn.attn', 'model.layers.38.self_attn.attn', 'model.layers.39.self_attn.attn', 'model.layers.40.self_attn.attn', 'model.layers.41.self_attn.attn', 'model.layers.42.self_attn.attn', 'model.layers.43.self_attn.attn', 'model.layers.44.self_attn.attn', 'model.layers.45.self_attn.attn', 'model.layers.46.self_attn.attn', 'model.layers.47.self_attn.attn', 'model.layers.48.self_attn.attn', 'model.layers.49.self_attn.attn', 'model.layers.50.self_attn.attn', 'model.layers.51.self_attn.attn', 'model.layers.52.self_attn.attn', 'model.layers.53.self_attn.attn', 'model.layers.54.self_attn.attn', 'model.layers.55.self_attn.attn', 'model.layers.56.self_attn.attn', 'model.layers.57.self_attn.attn', 'model.layers.58.self_attn.attn', 'model.layers.59.self_attn.attn', 'model.layers.60.self_attn.attn'], kv_cache_spec=FullAttentionSpec(block_size=64, num_kv_heads=1, head_size=576, dtype=torch.uint8, use_mla=True, sliding_window=None, attention_chunk_size=None))])
(Worker_TP3 pid=55844) 2025-10-02 07:04:14,399 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker_TP1 pid=55842) 2025-10-02 07:04:14,399 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker_TP2 pid=55843) 2025-10-02 07:04:14,399 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker_TP0 pid=55841) 2025-10-02 07:04:14,399 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker_TP3 pid=55844) 2025-10-02 07:04:14,958 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
(Worker_TP2 pid=55843) 2025-10-02 07:04:14,958 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
(Worker_TP0 pid=55841) 2025-10-02 07:04:14,958 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
(Worker_TP1 pid=55842) 2025-10-02 07:04:14,958 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  64%|████████████████████████████████████████████████▊                           | 43/67 [00:09<00:05,  4.51it/s](Worker_TP3 pid=55844) INFO 10-02 07:04:24 [custom_all_reduce.py:203] Registering 8241 cuda graph addresses
(Worker_TP2 pid=55843) INFO 10-02 07:04:24 [custom_all_reduce.py:203] Registering 8241 cuda graph addresses
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|████████████████████████████████████████████████████████████████████████████| 67/67 [00:15<00:00,  4.43it/s]
(Worker_TP0 pid=55841) INFO 10-02 07:04:30 [custom_all_reduce.py:203] Registering 8241 cuda graph addresses
(Worker_TP1 pid=55842) INFO 10-02 07:04:30 [custom_all_reduce.py:203] Registering 8241 cuda graph addresses
(Worker_TP3 pid=55844) INFO 10-02 07:04:31 [gpu_model_runner.py:3137] Graph capturing finished in 16 secs, took -0.68 GiB
(Worker_TP3 pid=55844) INFO 10-02 07:04:31 [gpu_worker.py:392] Free memory on device (139.19/139.8 GiB) on startup. Desired GPU memory utilization is (0.9, 125.82 GiB). Actual usage is 87.94 GiB for weight, 1.9 GiB for peak activation, 2.0 GiB for non-torch memory, and -0.68 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with `--kv-cache-memory=37059928780` to fit into requested memory, or `--kv-cache-memory=51419895296` to fully utilize gpu memory. Current kv cache memory in use is 36491600588 bytes.
(Worker_TP2 pid=55843) INFO 10-02 07:04:31 [gpu_model_runner.py:3137] Graph capturing finished in 16 secs, took -0.68 GiB
(Worker_TP2 pid=55843) INFO 10-02 07:04:31 [gpu_worker.py:392] Free memory on device (139.19/139.8 GiB) on startup. Desired GPU memory utilization is (0.9, 125.82 GiB). Actual usage is 87.94 GiB for weight, 1.9 GiB for peak activation, 2.0 GiB for non-torch memory, and -0.68 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with `--kv-cache-memory=37059928780` to fit into requested memory, or `--kv-cache-memory=51419895296` to fully utilize gpu memory. Current kv cache memory in use is 36491600588 bytes.
(Worker_TP1 pid=55842) INFO 10-02 07:04:31 [gpu_model_runner.py:3137] Graph capturing finished in 16 secs, took -0.68 GiB
(Worker_TP1 pid=55842) INFO 10-02 07:04:31 [gpu_worker.py:392] Free memory on device (139.19/139.8 GiB) on startup. Desired GPU memory utilization is (0.9, 125.82 GiB). Actual usage is 87.94 GiB for weight, 1.9 GiB for peak activation, 2.0 GiB for non-torch memory, and -0.68 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with `--kv-cache-memory=37059928780` to fit into requested memory, or `--kv-cache-memory=51419895296` to fully utilize gpu memory. Current kv cache memory in use is 36491600588 bytes.
(Worker_TP0 pid=55841) INFO 10-02 07:04:31 [gpu_model_runner.py:3137] Graph capturing finished in 16 secs, took -0.68 GiB
(Worker_TP0 pid=55841) INFO 10-02 07:04:31 [gpu_worker.py:392] Free memory on device (139.19/139.8 GiB) on startup. Desired GPU memory utilization is (0.9, 125.82 GiB). Actual usage is 87.94 GiB for weight, 1.9 GiB for peak activation, 2.0 GiB for non-torch memory, and -0.68 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with `--kv-cache-memory=37059928780` to fit into requested memory, or `--kv-cache-memory=51419895296` to fully utilize gpu memory. Current kv cache memory in use is 36491600588 bytes.
(EngineCore_DP0 pid=55692) INFO 10-02 07:04:31 [core.py:214] init engine (profile, create kv cache, warmup model) took 40.02 seconds
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712] EngineCore failed to start.
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712] Traceback (most recent call last):
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712]   File "/root/vllm/vllm/v1/engine/core.py", line 703, in run_engine_core
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712]     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712]   File "/root/vllm/vllm/v1/engine/core.py", line 502, in __init__
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712]     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712]   File "/root/vllm/vllm/v1/engine/core.py", line 122, in __init__
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712]     self.scheduler: SchedulerInterface = Scheduler(
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712]                                          ^^^^^^^^^^
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712]   File "/root/vllm/vllm/v1/core/sched/scheduler.py", line 166, in __init__
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712]     self.kv_cache_manager = KVCacheManager(
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712]                             ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712]   File "/root/vllm/vllm/v1/core/kv_cache_manager.py", line 105, in __init__
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712]     self.coordinator = get_kv_cache_coordinator(
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712]                        ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712]   File "/root/vllm/vllm/v1/core/kv_cache_coordinator.py", line 455, in get_kv_cache_coordinator
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712]     return UnitaryKVCacheCoordinator(kv_cache_config, max_model_len,
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712]   File "/root/vllm/vllm/v1/core/kv_cache_coordinator.py", line 251, in __init__
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712]     assert hash_block_size == self.kv_cache_spec.block_size
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712] AssertionError
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:34 [multiproc_executor.py:154] Worker proc VllmWorker-3 died unexpectedly, shutting down executor.
(EngineCore_DP0 pid=55692) Process EngineCore_DP0:
(EngineCore_DP0 pid=55692) Traceback (most recent call last):
(EngineCore_DP0 pid=55692)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=55692)     self.run()
(EngineCore_DP0 pid=55692)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=55692)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=55692)   File "/root/vllm/vllm/v1/engine/core.py", line 716, in run_engine_core
(EngineCore_DP0 pid=55692)     raise e
(EngineCore_DP0 pid=55692)   File "/root/vllm/vllm/v1/engine/core.py", line 703, in run_engine_core
(EngineCore_DP0 pid=55692)     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=55692)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=55692)   File "/root/vllm/vllm/v1/engine/core.py", line 502, in __init__
(EngineCore_DP0 pid=55692)     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=55692)   File "/root/vllm/vllm/v1/engine/core.py", line 122, in __init__
(EngineCore_DP0 pid=55692)     self.scheduler: SchedulerInterface = Scheduler(
(EngineCore_DP0 pid=55692)                                          ^^^^^^^^^^
(EngineCore_DP0 pid=55692)   File "/root/vllm/vllm/v1/core/sched/scheduler.py", line 166, in __init__
(EngineCore_DP0 pid=55692)     self.kv_cache_manager = KVCacheManager(
(EngineCore_DP0 pid=55692)                             ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=55692)   File "/root/vllm/vllm/v1/core/kv_cache_manager.py", line 105, in __init__
(EngineCore_DP0 pid=55692)     self.coordinator = get_kv_cache_coordinator(
(EngineCore_DP0 pid=55692)                        ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=55692)   File "/root/vllm/vllm/v1/core/kv_cache_coordinator.py", line 455, in get_kv_cache_coordinator
(EngineCore_DP0 pid=55692)     return UnitaryKVCacheCoordinator(kv_cache_config, max_model_len,
(EngineCore_DP0 pid=55692)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=55692)   File "/root/vllm/vllm/v1/core/kv_cache_coordinator.py", line 251, in __init__
(EngineCore_DP0 pid=55692)     assert hash_block_size == self.kv_cache_spec.block_size
(EngineCore_DP0 pid=55692)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=55692) AssertionError
(APIServer pid=55382) Traceback (most recent call last):
(APIServer pid=55382)   File "/root/venv/bin/vllm", line 10, in <module>
(APIServer pid=55382)     sys.exit(main())
(APIServer pid=55382)              ^^^^^^
(APIServer pid=55382)   File "/root/vllm/vllm/entrypoints/cli/main.py", line 54, in main
(APIServer pid=55382)     args.dispatch_function(args)
(APIServer pid=55382)   File "/root/vllm/vllm/entrypoints/cli/serve.py", line 50, in cmd
(APIServer pid=55382)     uvloop.run(run_server(args))
(APIServer pid=55382)   File "/root/venv/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run
(APIServer pid=55382)     return __asyncio.run(
(APIServer pid=55382)            ^^^^^^^^^^^^^^
(APIServer pid=55382)   File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
(APIServer pid=55382)     return runner.run(main)
(APIServer pid=55382)            ^^^^^^^^^^^^^^^^
(APIServer pid=55382)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=55382)     return self._loop.run_until_complete(task)
(APIServer pid=55382)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=55382)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=55382)   File "/root/venv/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
(APIServer pid=55382)     return await main
(APIServer pid=55382)            ^^^^^^^^^^
(APIServer pid=55382)   File "/root/vllm/vllm/entrypoints/openai/api_server.py", line 1956, in run_server
(APIServer pid=55382)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=55382)   File "/root/vllm/vllm/entrypoints/openai/api_server.py", line 1976, in run_server_worker
(APIServer pid=55382)     async with build_async_engine_client(
(APIServer pid=55382)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=55382)     return await anext(self.gen)
(APIServer pid=55382)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=55382)   File "/root/vllm/vllm/entrypoints/openai/api_server.py", line 180, in build_async_engine_client
(APIServer pid=55382)     async with build_async_engine_client_from_engine_args(
(APIServer pid=55382)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=55382)     return await anext(self.gen)
(APIServer pid=55382)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=55382)   File "/root/vllm/vllm/entrypoints/openai/api_server.py", line 222, in build_async_engine_client_from_engine_args
(APIServer pid=55382)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=55382)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=55382)   File "/root/vllm/vllm/utils/__init__.py", line 1595, in inner
(APIServer pid=55382)     return fn(*args, **kwargs)
(APIServer pid=55382)            ^^^^^^^^^^^^^^^^^^^
(APIServer pid=55382)   File "/root/vllm/vllm/v1/engine/async_llm.py", line 209, in from_vllm_config
(APIServer pid=55382)     return cls(
(APIServer pid=55382)            ^^^^
(APIServer pid=55382)   File "/root/vllm/vllm/v1/engine/async_llm.py", line 136, in __init__
(APIServer pid=55382)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=55382)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=55382)   File "/root/vllm/vllm/v1/engine/core_client.py", line 102, in make_async_mp_client
(APIServer pid=55382)     return AsyncMPClient(*client_args)
(APIServer pid=55382)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=55382)   File "/root/vllm/vllm/v1/engine/core_client.py", line 769, in __init__
(APIServer pid=55382)     super().__init__(
(APIServer pid=55382)   File "/root/vllm/vllm/v1/engine/core_client.py", line 448, in __init__
(APIServer pid=55382)     with launch_core_engines(vllm_config, executor_class,
(APIServer pid=55382)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=55382)     next(self.gen)
(APIServer pid=55382)   File "/root/vllm/vllm/v1/engine/utils.py", line 729, in launch_core_engines
(APIServer pid=55382)     wait_for_engine_startup(
(APIServer pid=55382)   File "/root/vllm/vllm/v1/engine/utils.py", line 782, in wait_for_engine_startup
(APIServer pid=55382)     raise RuntimeError("Engine core initialization failed. "
(APIServer pid=55382) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Notes:

I noticed that in those logs it says the KV cache is 2x the size compared to when running without this PR. Unsure if that's a relevant/expected detail. I (perhaps naively) didn't expect this PR to increase the KV cache capacity.

heheda12345 · 2025-10-02T07:43:16Z

Thanks for catching! I didn't try DCP yet. But why do you need this PR for deepseek-r1?

josephrocca · 2025-10-02T08:08:34Z

Hi @heheda12345, thanks for your comment. I was actually just testing this PR in case it solves a weird bug with DCP inference, seemingly related to incorrect KV cache storage/retrieval, which causes some requests to use the wrong KV data during inference when prefix caching is enabled.

From your question, it sounds like this PR is not related to DeepSeek R1/V3. I'm too inexperienced with this stuff to realise that ^^'

heheda12345 · 2025-10-04T05:28:55Z

Will rebase it next week to avoid the conflict with #25101

Signed-off-by: Chen Zhang <[email protected]>

…_cache

Signed-off-by: Chen Zhang <[email protected]>

heheda12345 · 2025-10-06T14:36:46Z

Will handle the DCP-related crash after #26296

Signed-off-by: Chen Zhang <[email protected]>

support different block size

23530a4

Signed-off-by: Chen Zhang <[email protected]>

mergify bot added the v1 label Sep 16, 2025

heheda12345 added 8 commits September 16, 2025 11:41

fix bug

4b3acea

Signed-off-by: Chen Zhang <[email protected]>

fix tests

f03bb61

Signed-off-by: Chen Zhang <[email protected]>

revert

f061d83

Signed-off-by: Chen Zhang <[email protected]>

rename BlockHashListWithBlockSize

48a70ae

Signed-off-by: Chen Zhang <[email protected]>

fix comments

286aeac

Signed-off-by: Chen Zhang <[email protected]>

fix comments

5be0625

Signed-off-by: Chen Zhang <[email protected]>

fix comments

4ad6559

Signed-off-by: Chen Zhang <[email protected]>

remove unnecessary logic

10df50e

Signed-off-by: Chen Zhang <[email protected]>

heheda12345 mentioned this pull request Sep 16, 2025

[HMA] Enabling Skipping SWA for Fp8 Quant #24912

Draft

5 tasks

heheda12345 commented Sep 16, 2025

View reviewed changes

heheda12345 marked this pull request as ready for review September 16, 2025 20:08

heheda12345 requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac, alexm-redhat and LucasWilkinson as code owners September 16, 2025 20:08

support block_size alignment

aaf8bc9

Signed-off-by: Chen Zhang <[email protected]>

add test

bab0be5

Signed-off-by: Chen Zhang <[email protected]>

heheda12345 changed the title ~~[Hybrid Allocator] Support layers with different hidden size~~ [Hybrid Allocator] Support KV cache groups with different block_size Sep 16, 2025

jmkuebler reviewed Sep 17, 2025

View reviewed changes

mergify bot added the needs-rebase label Sep 21, 2025

heheda12345 requested a review from ApostaC as a code owner October 4, 2025 05:21

heheda12345 added 4 commits October 6, 2025 03:14

Merge commit '17edd8a' into two_dtype_kv_cache

20a96f5

Signed-off-by: Chen Zhang <[email protected]>

ruff

ff3a21b

Signed-off-by: Chen Zhang <[email protected]>

Merge commit 'd6953be' into two_dtype_kv_cache

f988a71

Signed-off-by: Chen Zhang <[email protected]>

Merge branch 'main' of github.com:vllm-project/vllm into two_dtype_kv…

2ea1cca

…_cache

mergify bot removed the needs-rebase label Oct 6, 2025

heheda12345 added 4 commits October 6, 2025 03:44

fix test

b8d368f

Signed-off-by: Chen Zhang <[email protected]>

remove yapf tag

0d4fad3

Signed-off-by: Chen Zhang <[email protected]>

support block_size alignment

856cc74

Signed-off-by: Chen Zhang <[email protected]>

reduce diff

0c20bc2

Signed-off-by: Chen Zhang <[email protected]>

tmp fix on dcp

1fe5821

Signed-off-by: Chen Zhang <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Hybrid Allocator] Support KV cache groups with different block_size #24949

[Hybrid Allocator] Support KV cache groups with different block_size #24949

heheda12345 commented Sep 16, 2025 •

edited by github-actions bot

Loading

Uh oh!

heheda12345 Sep 16, 2025

Uh oh!

heheda12345 Sep 16, 2025

Uh oh!

heheda12345 Sep 16, 2025

Uh oh!

heheda12345 commented Sep 16, 2025

Uh oh!

heheda12345 commented Sep 16, 2025

Uh oh!

jmkuebler Sep 17, 2025

Uh oh!

jmkuebler Sep 17, 2025

Uh oh!

heheda12345 Sep 17, 2025

Uh oh!

heheda12345 Sep 17, 2025

Uh oh!

mergify bot commented Sep 21, 2025

Uh oh!

josephrocca commented Oct 2, 2025 •

edited

Loading

Uh oh!

heheda12345 commented Oct 2, 2025

Uh oh!

josephrocca commented Oct 2, 2025

Uh oh!

heheda12345 commented Oct 4, 2025

Uh oh!

heheda12345 commented Oct 6, 2025

Uh oh!

Uh oh!

Uh oh!

[Hybrid Allocator] Support KV cache groups with different block_size #24949

Are you sure you want to change the base?

[Hybrid Allocator] Support KV cache groups with different block_size #24949

Conversation

heheda12345 commented Sep 16, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Success

Uh oh!

heheda12345 Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

heheda12345 Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

heheda12345 Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

heheda12345 commented Sep 16, 2025

Uh oh!

heheda12345 commented Sep 16, 2025

Uh oh!

jmkuebler Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

jmkuebler Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

heheda12345 Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

heheda12345 Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Sep 21, 2025

Uh oh!

josephrocca commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Notes:

Uh oh!

heheda12345 commented Oct 2, 2025

Uh oh!

josephrocca commented Oct 2, 2025

Uh oh!

heheda12345 commented Oct 4, 2025

Uh oh!

heheda12345 commented Oct 6, 2025

Uh oh!

Uh oh!

heheda12345 commented Sep 16, 2025 •

edited by github-actions bot

Loading

josephrocca commented Oct 2, 2025 •

edited

Loading